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Brain atrophy, measured by MRI, has been proposed as a useful surrogate marker for disease progression in multiple 
sclerosis (MS). However, it is conventionally assumed that the accurate quantification of brain atrophy is made dif- 
ficult, if not impossible, by changes in the parameters of the MRI acquisition, which are almost inevitable over the 
course of a longitudinal study since MRI technology changes rapidly. This state of affairs can negatively affect clinical 
trial design and limit the use of historical data. Here, we investigate whether we can coherently estimate brain 
atrophy rates in a heterogeneous MS sample via linear mixed-effects multivariable regression, incorporating three 
critical assumptions: (1 ) using age at time of scanning, rather than time since baseline, as the regressor of interest; 
(2) scanning individuals with a variety of techniques; and (3) introducing a simple additive correction for major dif- 
ferences in MRI protocol. We fit the model to several measures of brain volume as the outcome in two MS popula- 
tions: 1123 scans from 195 cases acquired for over approximately 7 years in two natural history protocols (Cohort 
1 ), and 1331 scans from 69 cases seen for over 1 1 years who were primarily treated with two specific MS disease- 
modifying therapies (Cohort 2). We compared the mixed-effects model with additive correction for MRI acquisition 
parameters to a model fit without this correction and performed sample-size calculations to provide an estimate of 
the number of participants in an MS clinical trial that might be required to see a therapeutic effect of treatment using 
the approach described here. The results show that without the additive correction for Tl -weighted protocol param- 
eters, atrophy was underestimated and subject-specific estimates were more narrowly distributed about the popu- 
lation mean. Ventricular CSF is the most consistently estimated brain volume, with a mean of 2.8%/year increase in 
Cohort 1 and 4.4%/year increase in Cohort 2. An interesting observation was that gray matter volume decreased and 
white matter volume remained essentially unchanged in both cohorts, suggesting that changes in ventricular CSF 
volume are a surrogate for changes in gray matter volume. In conclusion, the mixed-effects modeling framework 
presented here allows effective use of heterogeneously acquired and historical data in the study of brain atrophy 
in MS, potentially simplifying the design of future single- and multi-site clinical trials and natural history studies. 

© 201 3 The Authors. Published by Elsevier Inc. All rights reserved. 
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1. Introduction 

Multiple sclerosis (MS) is an immune-mediated disease of the CNS 
that is characterized by focal demyelinating lesions and causing physical 
and cognitive impairment. Although these focal lesions are the most 
striking feature of the disease when MS patients are studied with magnet- 
ic resonance imaging (MRI), brain lesion volume per se is not strongly 
correlated with disability (Filippi et al., 1995; Furby et al., 2010; Kappos 
et al., 1999). Additionally, patients often continue to have progressive 
symptoms while developing few new lesions. 

One possible explanation for progressive disease in MS despite a 
paucity of new lesions is abnormally rapid brain atrophy, which has 
been noted in many studies that have examined MS with MRI (for re- 
view, see Giorgio et al., 2008). Atrophy has its origin in neuronal and 
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glial degeneration, primarily resulting in gray matter loss (Fisher et al., 
2008; Fisniku et al., 2008; Shiee et al., 2012). Gray matter volume has 
been shown to be an independent predictor of disability in all MS sub- 
types (Roosendaal et al., 2011). Therapies such as interferon beta-lb 
(Molyneux et al., 2000), glatiramer acetate (Rovaris et al., 2001), and 
natalizumab (Miller et al., 2007), which are effective at preventing 
new lesions, are ineffective at fully correcting atrophy rate (for review 
of the effect of therapeutics on atrophy, see Zivadinov et al. (2008)). 
Thus, atrophy rate has been postulated as a surrogate marker for disease 
progression and has been used as an outcome measure in many trials 
(several previously mentioned, as well as Filippi et al. (2004)). 

An issue with using brain atrophy as a surrogate for disease progres- 
sion is the inherent difficulty in measuring it. Most studies use normal- 
ized measures of brain volume that fall primarily into two categories. 
Longitudinal methods, such as SIENA, register two images from the 
same individual and look for changes in the brain surface (Smith et al., 
2002). Cross-sectional methods relate the volume of interest to a nor- 
malization volume, such as the skull or intracranial volume, and track 
changes in the relative size of brain structures over time (Chard et al., 
2002). Both types of method have been valuable in the study of atrophy 
in MS but require, or perform best with, MR1 data acquired in a homoge- 
neous fashion with respect to hardware, protocol, and acquisition pa- 
rameters. The reason is that these methods are analogous to studying 
the atrophy rate of each subject separately and averaging the results 
to generate an atrophy rate for the population of subjects. A spurious 
change in apparent brain volume induced by a change in acquisition pa- 
rameters in one or more scans on a particular subject will produce an in- 
accurate measurement of atrophy rate in that subject. A "second stage," 
population-level analysis will then produce a flawed population esti- 
mate of atrophy rate. This technical constraint limits study duration 
and sample size, since imaging technology is in constant flux and differ- 
ent centers often use different equipment and protocols, and means that 
brain atrophy is often estimated from relatively short-term data. How- 
ever, short-term biological variability in brain volume is well-known 
in MS, resulting from variations in factors such as hydration status and 
the effect of anti-inflammatory therapy known as pseudoatrophy, 
which can result in loss of brain volume without actual loss of brain tis- 
sue upon the initiation of treatment (Fox et al., 2005; Mellanby and 
Reveley, 1982; Rao et al„ 2002; Sampat et al., 2010; Zivadinov et al., 
2008). Compounding the problem is that the absolute change in brain 
volume that represents real atrophy is small relative to this biological var- 
iability (De Stefano et al., 2010; Rao et al., 2002). The result - that follow- 
up must essentially start over with every new trial or improvement in 
MR1 equipment or technique — is especially difficult when good quality 
historical control data from MS trials exist (Shirani et al., 2012). 

In the analysis of longitudinal data, linear mixed-effects multivari- 
able regression analysis enjoys several well-described statistical advan- 
tages over the described "two-stage" or so-called "NIH method" (Cnaan 
et al., 1997) (for textbook treatment, see Fitzmaurice et al. (2011)). 
Mixed-effects regression enables the generation of both population- 
level and subject-specific atrophy rates, permits the use of data from 
subjects with few data points, and also allows the quantification of the 
effects on atrophy of variables such as therapy and lesion load at the 
population level. In fact, mixed-effects models have been applied to at- 
rophy rates generated using the aforementioned methods to calculate 
sample sizes for trials (Anderson et al., 2007). 

The goal of this study was to investigate whether longitudinal data 
from multiple scanning protocols can be reasonably integrated into a 
coherent estimate of individual- and population-level brain volume 
changes in MS. We hypothesized that applying linear mixed-effects re- 
gression to absolute brain volumes, rather than to atrophy rates gener- 
ated separately for each patient, would allow us to correct for the effect 
of MRI acquisition differences at the population level using an additive 
factor. To accomplish this, we postulate that regression should be 
performed against age at the time of scanning (rather than time elapsed 
since the first scan, with age as covariate), and also that scans with 



different protocols be performed on individual subjects. We show that 
this approach can generate reasonable results for atrophy rate in two 
populations of MS patients with intra-subject variation in MRI acquisi- 
tion parameters — Cohort 1, a collection of heterogeneous longitudinal 
data from two MS natural history protocols, including scans performed 
off and on disease-modifying therapy; and Cohort 2, derived from a 
retrospective analysis of data obtained from MS cases seen at our 
center over an 11 -year period who were predominantly treated with 
daclizumab, an anti-CD25 monoclonal antibody, and interferon beta. 
An important implication of this work is that it opens the doors for com- 
parison of historical data to data more recently acquired, particularly if 
individual subjects were studied with a spectrum of protocols, thereby 
decreasing the impact of small sample sizes and short-term follow-up 
durations on studies of brain atrophy. 



2. Materials and methods 

2.1. Cohort 1 

This cohort includes a convenience sample of MS cases enrolled in 
our center between October 2005 and September 2011, who were 
being evaluated in the context of two ongoing natural history protocols 
(the last scan in this cohort was performed in December 201 1 ). We ini- 
tially analyzed 1491 MRIs from 373 cases using our automated image- 
processing software (see below). Of these, we excluded 111 cases 
with only one MRI, as we were only interested in longitudinal changes 
in brain volume. Of the remaining 262 cases, we excluded 3 because 
our software failed to provide reasonable estimates of lesion volume 
from their images. We also removed 64 cases because their final diagno- 
sis was not MS according to 2005 and 2010 McDonald criteria (Polman 
et al., 2005, 2011). A final pool of 1123 MRIs from 195 cases remained 
for the analysis. Of these scans, 23 were later removed because they 
had extreme ventricular sizes that precluded reliable quantification of 
brain volume (see Section 2.4). Table 1 contains a summary of demo- 
graphic and clinical characteristics; a summary of MRI acquisition pa- 
rameters may be found in Table 2. Subjects in Cohort 1 were treated 
with a variety of disease modifying therapies, including interferon 
beta 1 a and lb, glatiramer acetate, daclizumab, natalizumab, idebenone, 
mitoxantrone, and fingolimod; 773 scans were from subjects treated 
with any disease-modifying agent during the observation period, 
while 327 were from those never treated during this period. Modeling 
of individual treatment effects was beyond the scope of this work. 
Differences in follow-up length were not considered in our analysis, 
as a multivariable regression of age, sex, EDSS, and disease type on 
follow-up length (results not reported) yielded no association between 
any of these factors and follow-up length. 



Table 1 

Relevant patient sample characteristics for both Cohort 1 , a convenience sample of MS 
patients from two natural history protocols, and Cohort 2. patients whose data were 
used in a post hoc analysis of the therapeutic effect of daclizumab in MS. MS: RR, 
relapsing-remitting; PP, primary progressive; SP, secondary progressive; CIS, clinically 
isolated syndrome. 



Data set 


Cohort 1 


Cohort 2 


N (by sex) 


116 F,79 M 


44 F, 25 M 


Number of scans 


1123 


1331 


Mean age (range) 


42 years (17-68 years) 


38 years (18-60 years) 


Median scans per 


4 (2-27) 


15 (2-55) 


person (range) 






Mean total follow-up 


1.2 years (21 days-5.2 years) 


5.9 years (1.1-10.5 years) 


time (range) 






Disease type 


RRMS: 142; PPMS: 34; 


All RRMS 




SPMS: 12; CIS: 7 




Median EDSS (range) 


1.5 (0-8.5) 


1.5 (0-6.5) 
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Table 2 

MR1 acquisition parameters. Nominal in-plane resolution was approximately lxl mm for all scans. T1WP, Tl -weighted protocol; Vol, volume coil; 8-ch, 8-channel head coil; 
TR, repetition time; TE, echo time; Tl, inversion time; F, FSPGR (fast spoiled gradient echo); M, MPRAGE (magnetization-prepared rapid acquisition of gradient echoes). 



Cohort 1 Cohort 2 



T1WP: 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


1 


2 


3 


4 


Number of scans 


5 


335 


2 


1 


68 


305 


5 


289 


30 


83 


188 


996 


38 


110 


Scanner manufacturer 


CE 


CE 


CE 


CE 


CE 


CE 


CE 


CE 


CE 


Philips 


CE 


CE 


CE 


CE 


Field strength (T) 


1.5 


1.5 


1.5 


1.5 


1.5 


1.5 


1.5 


3 


3 


3 


1.5 


1.5 


1.5 


1.5 


Receive coil 


Vol 


Vol 


Vol 


Vol 


8-ch 


8-ch 


8-ch 


8-ch 


8-ch 


8-ch 


Vol 


Vol 


8-ch 


8-ch 


TR (ms) 


12 


9-10 


7 


8 


10 


9 


8 


9 


8-9 


7 


~9 


~9 


-10 


-9 


TE (ms) 


5 


2-3 


3 


3 


3 


3.5 


3 


3.5 


3 


3 


~2 


-2 


-3 


-3.5 


Tl (ms) 


None 


None 


400 


750 


None 


450 


750 


450 


725 


900 


None 


None 


None 


450 


Flip angle (deg) 


20 


20 


12 


16 


20 


13 


16 


13 


6 


9 


20 


20 


20 


13 


Slice thickness (mm) 


1.2 


1.4 


1.5 


1 


1.4 


1.5 


1 


1 


1 


1 


1.3 


1.4 


1.4 


1.5 


Acquisition protocol 


F 


F 


F 


F 


F 


F 


F 


F 


M 


M 


F 


F 


F 


F 



2.2. Cohort 2 

Cohort 2 consisted of MS cases whose data were used in a post hoc 
analysis of the therapeutic effect of daclizumab (Borges et al., 2013). 
Seven of the 69 cases in Cohort 2 were also in Cohort 1 (comprising 
only 20 scans). Table 1 contains clinical and demographic characteristics 
of these data, while Table 2 shows the MR1 acquisition parameters. 
There were 26 cases treated with daclizumab and 43 non-daclizumab 
cases; the effect of therapy was not considered here. 

2.3. Image processing 

Lesion-TOADS is a fully automated, topology-preserving brain seg- 
mentation algorithm based on image intensity and a statistical prior 
probability atlas that enables simultaneous tissue classification and MS 
lesion detection and compares favorably with other segmentation tech- 
niques (Shiee et al„ 2010). The algorithm uses both a Tl-weighted 
image and a T2 -FLAIR image to generate a classifier image and accounts 
for the effect of lesions on signal intensity in the Tl-weighted image. 
The output from lesion-TOADS includes brain volumes for each of the 
17 tissue types it considers, as well as the lesion volume. The Tl- 
weighted images are used for brain structure segmentation, whereas 
the T2-FLAIR images are used for lesion segmentation. In this study, 
our tissue types of interest were ventricular CSF (vCSF) as well as 
supratentorial gray matter (sGM), white matter (sWM), and total vol- 
ume (sTot). Average lesion volume, calculated as described below, 
was also considered. sWM volume included both lesions and normal- 
appearing white matter. 

Before applying lesion-TOADS, Tl-weighted and T2-FLA1R im- 
ages were inhomogeneity corrected using the N3 algorithm (Sled 
et al., 1998), and the chronologically middle image was rigidly 
registered to the MNI-152 space; this middle image was also used 
as the target for registration of all other images for that subject. A 
mean image was then computed by voxel-wise averaging across 
all of the subject's scans. The skull was removed automatically 
using SPECTRE (Carass et al., 2007), generating a brain-and-CSF 
mask for each individual scan. Each image was normalized by 
subtracting the mean whole brain intensity and dividing by its stan- 
dard deviation (Shinohara et al., 2011), and the images were then 
re-registered to the mean image. Lastly, the histograms for each 
image were matched to the histogram of the mean image using 
the "Histogram Image Matching" algorithm in MIPAV (medical 
image processing, analysis, and visualization, http://mipav.cit.nih. 
gov). Lesion-TOADS was then applied to each image. 

A single lesion volume was generated for each case by averaging all 
volumetric Tl-weighted images and all T2-FLAIR images from that case 
and applying lesion-TOADS to the average images. In our experience, 
this produced a more reliable single estimate of overall lesion volume 
than averaging the lesion load estimates from individual scans, at the 
cost of losing the ability to track lesion volume over time and possibly 



inducing an overestimation of lesion volume (as lesions tend to accu- 
mulate over time). 

2.4. Statistical methods 

A linear multivariable mixed-effects regression model was applied 
separately to each tissue type of interest to quantify the rate of volume 
change: 

Voly = ft + ft * AGEy + ft * SEX, + ft * LESION ] + ft * 71 WP {j + b oi 
+ b u *ACE ij + E i j. 

In this model, the predicted volume of interest for subject i at time 
j (Voly) is determined by population ("fixed") effects, represented by 
ft.. .ft, and subject-specific ("random") effects, b oi and b Ai . The average 
rate of change in the population is ft, while the subject-specific differ- 
ence in that rate is captured by b u . Sex and Tl-weighted protocol 
(T1WP) were modeled as indicator variables. T2-FLAIR protocol is not 
included because the T2-FLAIR image is predominantly used by lesion- 
TOADS for delineation of lesions rather than of brain tissue structure. 
Differences in T1WP were captured in a descriptive "code" generated 
from the scanner manufacturer, magnetic field strength, receive coil, ac- 
quisition type (fast spoiled gradient echo [FSPGR] vs. magnetization- 
prepared rapid acquisition of gradient echoes [MPRAGE]), inversion 
time (Tl), and slice thickness. Data are presented with this correction 
applied, so that model fits can be shown as straight lines. Age in years 
and lesion load (LESION) are continuous variables. The model was fitted 
once to vCSF volume data, and an empirical threshold for the residuals 
was established (absolute value of the residual >3.75 ml). Scans with 
data above this cutoff were found to be those with extreme ventricular 
sizes. These scans were removed (23 scans), and the model was refit for 
each structural volume of interest. The rationale for this is that lesion- 
TOADS most often fails on scans with extremely large or small ventricles 
due to its use of a statistical atlas for tissue types. The results of the 
model were compared to those obtained when fitting the same model 
but removing Tl WP; a likelihood ratio test was performed to compare 
the two models. For vCSF volume in Cohort 1, results for fitting a 
model including a fixed effect for the square of age as well as a model 
including a fixed effect for ever treated vs. never treated during the ob- 
servation period are also reported. Additionally, the effect of using age 
as the time variable was compared to structuring the model with time 
elapsed since first enrollment as the time variable and age as a covariate. 
For Cohort 2, longitudinal lesion load was not included in the previously 
reported modeling results (Borges et al., 2013). 

The distribution of vCSF volumes in the population as a whole was 
found to be right-skewed in both datasets, so a natural logarithmic trans- 
formation was applied before model fitting. Therefore, atrophy rates 
for vCSF are given as %/year. A normal distribution reasonably approxi- 
mated all of the supratentorial tissue measures, so no transformation 
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was applied to these data, and atrophy rates are reported as absolute rates 
in ml/year. 

Statistical analysis was performed in R version 2.14.1 (R 
Development Core Team, 2011). The lme4 package version 
0.999375-42 (Bates et al., 2011) was used for the fitting of mixed 
effects models using restricted maximum likelihood estimation. 
Graphics were produced using the ggplot2 package, version 0.9.1 
(Wickham, 2009). In lieu of p values, the 95% confidence interval 
is reported for calculated values as appropriate, since many com- 
parisons were performed and showing "statistical significance" of the 
results is not the goal of these analyses. 

2.5. Sample size calculations 

The sample sizes necessary in each ami of a hypothetical therapeutic 
trial to detect effects of 25%, 50%, and 75% of the therapy on the atrophy 
rate were calculated using the following formula (Diggle et al., 2002): 

N = \\ ( z i-«/2 + z i-p) 2 (°w + 2 f) 



where N represents the number of subjects per trial arm necessary to 
detect a change in atrophy rate y, assuming variance of the slopes o 2 ,^ 
and variance of the residuals cr E 2 , both of which were estimated from 
our sample. Z-scores for a = 0.05 and power = 1 — 0 = 80% were 
used. Results were calculated for a simplified trial with an initial scan 
and one follow-up scan; follow-up scan times of one and two years 
were compared. 

3. Results 

Fig. 1 shows representative Tl-weighted and T2-FLA1R images from 
one case at a single time point, as well as the lesion-TOADS classifier 
mask associated with them. 

Mixed-effects models with and without Tl WP correction were fitted 
to the data collected from Cohort 1 , a heterogeneous population of MS 
cases (see Materials and methods). The effect of including an additive 
correction factor based on a code that captures the variability associated 
with 6 parameters of the MR1 acquisition is demonstrated in Fig. 2. The 
additive corrections for each T1WP code were calculated from the 
mixed-effects model fits and are constrained by the population-level 
effects of age, sex, and total lesion volume, as well as the clustering of 
data by subject. These correction factors were then explicitly added to, 
or subtracted from, the original measurements in order to create the 
TlWP-corrected values shown in the plots. Other acquisition types 
were corrected to protocol 1 (see Table 2). The mixed effects fits are 
shown with Tl WP-correction (black lines) and without it (red lines). 




Fig. 1. Representative A) Tl-weighted (FSPGR), B) T2-FLAIR, and C) lesion-TOADS 
classification images. The lesion-TOADS image shows classification of cortical gray 
matter (orange), white matter (white), ventricles (brown), and lesions (red). (For 
interpretation of the references to color in this figure legend, the reader is referred 
to the web version of this article.) 



The effect of T1WP correction is explored in two individual subjects 
in Fig. 3, which includes notation of the combinations of acquisition 
parameters for each scan (with reference to Table 2). 

The population estimates of atrophy rate, as well as the variability in 
the subject-specific estimates as quantified by the standard deviation, 
are given in Table 3. Additionally, Table 3 provides results for each tissue 
type when the model includes time elapsed since enrollment/first scan 
as the time variable, with age as covariate. Note that likelihood ratio 
tests comparing the model results without and with the T1WP correc- 
tion were all highly significant (p < 2.2 x 10~ 16 ), probably due to the 
high number of observations and relatively small number of terms in 
the model. Moreover, residual variance is substantially lower for the 
models that include the T1WP correction. 

The results indicate that without the correction for Tl WP, the model 
assigns a lower population rate of vCSF increase and a higher rate of 
sWM decrease relative to the results obtained with the T1WP correc- 
tion. Stability over time of sWM volume, as determined by the TlWP- 
corrected model, is consistent with prior studies (Pirko et al., 2007; 
Shiee et al., 2012). Moreover, the uncorrected models yielded a 
narrower distribution of subject-specific atrophy rates around the pop- 
ulation mean, whereas the TlWP-corrected model closely approximat- 
ed individual subject trends, particularly for vCSF. Age and time elapsed 
both provide similar estimates of atrophy for vCSF when used as the 
time variable, although the variance in the estimate is larger when 
using time elapsed. For sGM and sWM, however, using age more strong- 
ly constrains the estimate of atrophy rate in the face of the high variabil- 
ity in these data. Residual variance is similar for these two classes of 
model. 

A model including a fixed effect for the square of age reduced the 
utility of the age predictor while not adding great predictive value to 
the model for vCSF data. In particular, in the quadratic model the atro- 
phy rate with age was 2.0%/year [95% CI: —1.4, 4.3] and with age 2 
0.011%/year [-4.2, 4.4]; the SD of atrophy rates was 2.6%/year. Hence, 
we performed further analysis without the quadratic term. A fixed effect 
of ever vs. never treated with any disease-modifying agent during the 
observation period did not add predictive value to the model (regres- 
sion coefficient for treatment: 1 5%/year [ — 1 .3,35] ) and was also exclud- 
ed from further analyses. 

Fig. 4 shows the mixed-effects fits with and without T1WP correc- 
tion for the cases from Cohort 2. To facilitate comparison to Cohort 1, 
no treatment effect of daclizumab was considered, unlike in Borges 
et al. (2013), where daclizumab therapy was shown to reduce the rate 
of brain volume change relative to interferon beta. Population atrophy 
rate and within-subject variability estimates for Cohort 2 are found in 
Table 4. In this cohort, long follow-up and relatively low within- 
subject variability (due in part to the use of more similar acquisition 
protocols) led to reasonable individual- and population-level estimates 
for both vCSF and sTot. As in Cohort 1 , atrophy was underestimated, and 
subject-specific estimates were more narrowly distributed about the 
population mean, in the absence of T1WP correction. Note that we did 
not separate the data into gray and white matter components for Cohort 
2, because gray-white segmentation was not particularly reliable in this 
dataset, where most scans were acquired at 1 .5 T with a volume receive 
coil (Borges et al., 2013). 

To study the effect of the differing within-subject variability be- 
tween vCSF and sGM in Cohort 1 on the ability to detect changes in at- 
rophy rate, we performed calculations to estimate the sample sizes in 
each arm of therapeutic trials necessary to detect 25%, 50%, and 75% dif- 
ferences in atrophy rate. A simplified trial design of one initial scan and 
one follow-up scan at either one or two years was used (Table 5). The 
results suggest that about 18 times as many cases are required to detect 
changes in atrophy rate for sGM volume as compared to vCSF volume 
for the one-year trial, whereas about 10 times as many cases are neces- 
sary for the two-year trial. Note that these sample sizes are provided for 
illustration only, as it is highly unlikely that a clinical trial would be 
designed using a cohort and acquisition parameters similar to ours. 
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Fig. 2. The effect of Tl -weighted protocol (T1WP) correction in 195 MS cases in Cohort 1 (natural histoiy protocol; individual cases represented with different hues of blue and green). 
A) Log-transformed ventricular cerebrospinal fluid (CSF) volume and fits generated by the mixed effects model that did not incorporate T1WP correction (black lines) B) The fits (red 
lines) of the model with T1WP correction for ventricular CSF; the point-by-point data were also corrected for T1WP. C) and D) are analogous plots for supratentorial gray matter (GM) 
volume, while E) and F) show the same for supratentorial white matter (WM) volume. (For interpretation of the references to color in this figure legend, the reader is referred to the 
web version of this article.) 



4. Discussion 

Brain atrophy in MS is a well-studied phenomenon that is thought to 
reflect neuroaxonal loss and that, as a result, has been postulated as a 
surrogate measure of disease progression. Our results demonstrate 
that mixed-effects modeling of absolute brain volumes can robustly 
allow the use of larger, more heterogeneous datasets to improve esti- 
mation of atrophy rates in MS populations. The methodology presented 
here may be further tested in large-scale, multi-site clinical trials. 



Specifically, our framework incorporates: (1) explicit integration of 
age as the independent variable of interest, given clear population-wise 
(and essentially linear) associations of brain structure volumes with 
age; (2) simple statistical modeling of heterogeneous data acquisition 
protocols to reduce uninformative variability in the data, which is facili- 
tated by acquiring data from individual subjects using a variety of proto- 
cols; and (3) generation of subject-specific rates of brain volume change. 

The goal of this study was not to define the optimal model for inte- 
grating heterogeneous data into a single analysis. Nevertheless, it is 
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Fig. 3. The effect of Tl -weighted protocol (T1WP) correction in A) a representative subject 
who underwent 18 scans for over 2 years with 4 different Tl-weighted protocols and 
B) a second representative subject who underwent 10 scans in 2 years and 4 months. 
Plots show log-transformed ventricular cerebrospinal fluid (CSF) volume; the green line 
represents uncorrected data, while the blue line represents data with additive T1WP 
corrections applied. The number on each uncorrected point represents protocol number, 
as described in Table 2. (For interpretation of the references to color in this figure legend, 
the reader is referred to the web version of this article.) 

interesting to note that even the simple additive correction used here 
substantially reduces within-subject variability. In-depth modeling of 
the effects of individual scanning parameters is beyond the scope of 
this study but could certainly be incorporated to further reduce the var- 
iability and improve estimation of atrophy rates. Also beyond the scope 
of this study is determining the specific numbers of subjects and scans 
per protocol that would be necessary to apply the methods described 
here. Although the success of the method would certainly increase 



with the number of subjects and scans, in our experience simple inspec- 
tion of the corrected volume trajectories should yield a good qualitative 
sense of the success or failure for a particular dataset. An additional re- 
finement of the model that includes a fixed effect for the square of age in 
estimation of atrophy rates did not have increased predictive power, 
consistent with a linear atrophy rate over time as seen in several reports 
(Martola et aL, 2010; Shirani et al., 2012). Further refinements of the 
model to test the effects of lesion load and disease subtype, as well as in- 
teractions of scanning protocol with time, could proceed under the 
same framework. Additionally, the effect of specific disease-modifying 
therapies could be investigated in much more detail. A simplistic ever 
vs. never treated model in Cohort 1, a convenience sample, did not in- 
crease predictive power, but applying a similar but more specific 
model in Cohort 2 yielded a clear effect of daclizumab therapy (Borges 
et al.,2013). 

We note that the proposed framework can be applied to any scalar, 
continuous variable: any measured quantity derived from imaging or 
other data can be used as the outcome variable, as long as the variable 
(or a transform thereof) varies deterministically with age across the 
population. Our data suggest that using age as the time variable in 
data with relatively lower variability (e.g., vCSF) provides estimates 
similar to those obtained using time elapsed since enrollment (although 
using age does provide a narrower confidence interval even in this 
case). However, as variance in the data increases, as occurs for sGM 
and sWM volumes, using time elapsed provides too weak a constraint 
upon the high variability between scans to tease out the population- 
level atrophy rate. 

These considerations make the mixed-effects modeling framework 
proposed here a very strong option in the population-level analysis of 
brain atrophy in MS, especially when long-term data are available. The 
model could even conceivably be applied in clinical practice to make in- 
ferences about individual patients whose historical images were collect- 
ed with varying acquisition parameters — a very common situation, 
particularly at referral centers where patients bring prior MRl scans at 
the time of first evaluation. For example, a particular MS center might 
apply the model to determine a correction factor for brain volume for 
each scanner and protocol combination, which would allow the inter- 
pretation of brain volume changes on the subject level over long periods 
of time. 

One interesting outcome of this study is the suggestion that vCSF 
volume can stand in for sGM volume, since sWM volume appears to 
change very little (if at all) over time. There is little doubt that primary 
vCSF expansion is unlikely in MS — rather, changes in vCSF are primarily 
due to the loss of sGM, the primary component of atrophy in MS (Fisher 
et aL, 2008; Fisniku et al., 2008; Shiee et al., 2012). Of note, we did not 
attempt to separate different gray matter structures, such as cortical 
vs. deep gray matter, and we did not examine changes in infratentorial 
gray matter volume. These insights are supported by a recent study 
showing that lesion volume is correlated more strongly with global 
than regional atrophy, as well as a study describing the ability of lateral 
ventricle volume to differentiate stable from progressive patients 
(Antulov et al., 2011 ; Horakova et al., 2009). 



Table 3 

Estimates of atrophy rates, with associated 95% confidence intervals (CI) as well as standard deviations (SD) for subject-specific slopes, using the mixed-effects model with and without 
T1WP correction for ventricular cerebrospinal fluid (vCSF), supratentorial gray matter (sGM), and supratentorial white matter (sWM) volumes in Cohort 1. The residual variance of the 
model is also reported. The last row shows the calculated atrophy rates when time elapsed since enrollment is used as the time variable in the mixed-effects model with age as a covariate. 
T1WP, Tl-weighted protocol. 



Model 


vCSF rate 


SD of vCSF rates 


Residual variance 


sGM rate 


SDofsGM 


Residual 


sWM rate 


SD ofsWM 


Residual 




[95% CI] 




[(log-mm 3 ) 2 ] 


[95% CI) 


rates 


variance 


[95% CI] 


rates 


variance 


T1WP correction 


2.8%/year 


±2.6%/year 


1.6 x 10~ 3 


— 2.1 ml/year 


±0.77 ml/year 


190 ml 2 


0.18 ml/year 


±0.70 ml/year 


310 ml 2 




[2.1,3.5] 






[-2.7,-1.4) 






[-0.42, 0.78) 






No Tl WP correction 


0.72%/year 


± 1 .7%/year 


3.6 x 10~ 3 


— 2.1 ml/year 


±0.67 ml/year 


610 ml 2 


— 0.46 ml/year 


±0.55 ml/year 


540 ml 2 




[0.15, 1.3] 






[-2.8, -1.5) 






[-1.0,0.11) 






T1WP correction, time 


2.9%/year 


±4.8%/year 


1.3 x 10~ 3 


0.47 ml/year 


±5.2 ml/year 


160 ml 2 


0.14 ml/year 


± 5.6 ml/year 


280 ml 2 


since enrollment 


[1.5,4.2] 






[-2.0,2.9] 






[-2.5, 2.8] 
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Fig. 4. The effect of Tl -weighted protocol (T1WP) correction in 69 MS cases in Cohort 2 (cases used in a post-hoc analysis of daclizumab (Borgeset al, 2013); individual cases represented 
with different hues of blue and green). A) Log-transformed ventricular cerebrospinal fluid (CSF) volume and fits generated by the mixed effects model that did not incorporate Tl WP cor- 
rection (black lines). B) The fits (red lines) of the model with T1WP correction for ventricular CSF; the point-by-point data were also corrected forTlWP. C) and D) are analogous plots for 
supratentorial total volume. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) 



The use of the percentage change in vCSF (i.e., performing a log- 
transformation on the data before applying the model) is reasonable 
since the changes in vCSF are small relative to the vCSF volume and 
the ventricular volume cannot be negative. An exponential model ad- 
justs for the increasing variability between older subjects. Since sulcal 
CSF is difficult to estimate, as it requires consistent and accurate "skull 
stripping," we do not consider it when using vCSF. However, this omis- 
sion of sulcal CSF could lead to incorrect estimation of the atrophy rate. 

The main limitation of statistical modeling with regression in our 
study is that there is no way to prove that the results obtained represent 
the "ground truth." Important effects may not be accounted for; for ex- 
ample, although the purpose of our study was not to measure a differ- 
ence in atrophy rate with different therapeutic modalities, therapy 
does affect atrophy rate (Borges et al., 2013). A long-term follow-up 
study taking place on identical hardware with identical protocols and 
analyzed with the same method presented herein would be the ideal 
way to verify that our method generates atrophy rates similar to those 
measured by established methods, but such data are rare and were 



not available to us. Indeed, as the difficulty in conducting such a study 
is the problem this research is designed to circumvent, it is easy to see 
the challenge in comparing these results to a "gold-standard" approach. 
Additionally, with variability in the methods used to estimate brain tis- 
sue volume as well as the natural variability in atrophy between MS pa- 
tients, it is unclear what meaning can be ascribed to minor differences in 
the estimation of atrophy rate seen between this and prior atrophy 
studies. 

Short-term or highly variable data, such as the sGM and sWM data 
from Cohort 1, lead to less accurate subject-specific estimates, regard- 
less of the model being used. In this case, the mixed-effects model esti- 
mates atrophy in a similar fashion to a population-level regression. 
Although this is a technically valid way to generate a population esti- 
mate of atrophy rate using these data, the variability present makes 
the utility of the calculated subject-specific rates unclear. Our model 
does perform better with longer follow-up length, as can be seen by 
comparing Cohorts 1 and 2. One prior study (Hughes et al., 2012) 
found that strong within-cohort rank stability in the EDSS score was 



Table 4 

Estimates of atrophy rates, with associated 95% confidence intervals (CI) as well as standard deviations (SD) for subject-specific slopes, using the mixed-effects model with and without 
T1WP correction for ventricular cerebrospinal fluid (vCSF) and total supratentorial (sTot) volumes in Cohort 2. The residual variance of the model is also reported. T1WP, Tl -weighted 
protocol. 



Model 


vCSFrate[95% CI] 


SD vCSF rates 


Residual variance [(log-mm 3 ) 2 ] 


sTot rate [95% CI) 




SD sTot rates 


Residual variance 


T1WP correction 


4.4%/year [3.3, 5.4] 


± 4.0%/year 


2.9 x 10~ 3 


-4.8 ml/year [-6.5, 


-3.2] 


± 6.5 ml/year 


540 ml 2 


No T1WP correction 


4.0%/year [3.0, 5.0] 


±3.8%/year 


3.1 x 10~ 3 


-3.5 ml/year [-5.1, 


-1.9] 


± 6.2 ml/year 


660 ml 2 
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Table 5 

Sample size (SS) necessary in each arm of a therapeutic trial to detect differences of 25%, 
50%, and 75% in the atrophy rate using ventricular CSF (vCSF) volume and supratentorial 
gray matter (sGM) volume, based on the variance of the slopes and residuals from the 
mixed-effects model fit to Cohort 1 (see text for formula). 



Trial length 


% diff. in atrophy rate 


vCSFSS 


sCMSS 


1 year 


25 


160 


2828 




50 


40 


707 




75 


18 


315 


2 years 


25 


65 


712 




50 


17 


178 




75 


8 


80 



only seen after 4 years of follow-up in a population of all MS sub- 
types, illustrating that this problem is not limited to our model 
alone and arises from natural variability in the disease and in the 
methods of assessing it. At the same time, the need for longer 
follow-up length competes with other considerations when design- 
ing and conducting clinical trials of MS therapeutics that use brain 
atrophy as a surrogate outcome. 

Sample-size calculations using our data support the importance of 
follow-up length. Specifically, we found that, after adjusting for hetero- 
geneously acquired MR1 data, increasing follow-up length provides a 
decreased sample size requirement for MS clinical trials. These sample 
sizes are similar to or even improved over other published reports 
(Altmann et al., 2009; Anderson et al., 2007). As discussed previously, 
however, high within-subject variance (resulting in high residuals in 
the formula used to calculate sample size) necessitates a very large 
sample size and makes the use of direct estimation of the gray matter 
volume impractical with our method. It is possible that more homoge- 
neously acquired MRI data, coupled with improved cross-sectional 
and longitudinal (Dwyer et al., 2013) segmentation methods, could 
greatly decrease the sample size requirement for sGM, but, as discussed, 
we suggest that vCSF volume is a practical surrogate measure for sGM. 
In addition to the fact that the data used for the sample-size calculation 
presented here are not likely to be replicated in any real clinical trial, a 
realistic trial would also have more than one time point. Unfortunately, 
for the random intercept/random slope mixed-effects model presented 
here, there is no known formula to analytically calculate the sample 
sizes, necessitating numerical simulations that are beyond the scope of 
this paper. Briefly, with a model similar to ours as the starting point, a 
fixed treatment effect would be added and levels of this effect on atro- 
phy rate chosen for study. Then, simulated data sets for the time points 
chosen for the trial would be created with increasing sample size; the 
sample size at which the power to reject the null hypothesis exceeded 
a predetermined level would be the sample size necessary for the trial. 

One reason that uniform MRI acquisition protocols are generally pre- 
ferred is that current segmentation techniques tend to perform poorly 
when tissue contrast changes, even slightly, as a result of variation in ac- 
quisition parameters. Although lesion-TOADS allows a direct estimation 
of brain volume from cross-sectional data, the between-scan variability 
in the estimation of the gray-white border that arises when acquisition 
parameters change is high enough to result in poor subject-specific 
model fits, especially when the data are not collected over a sufficiently 
long follow-up period. Although we apply a statistical control for proto- 
col type in our model, very small changes in contrast that are not reli- 
ably related to a particular protocol, or that are not accounted for in 
our Tl WP code, can still have a large effect on the segmentation results. 
This property of the segmentation, combined with much higher tissue 
contrast between ventricles and brain than between gray and white 
matter, accounts for the more reliable estimation of vCSF volume than 
sGM or sWM volumes. Our use of average lesion volume rather than le- 
sion volumes at individual time points arises from the same issue, and is 
limiting in that patients with longer disease durations are likely to have 
more lesions. This problem highlights the need for longitudinal segmen- 
tation techniques that consider all the information available for a given 



subject when segmenting tissue types; one such method was recently 
presented ( Dwyer etal.,2013).A "smarter" segmentation would recog- 
nize, for example, that an outlier result from a particular scan that 
shows excessive change in gray matter volume compared to prior 
scans is probably spurious. Such a method might increase the ability 
of multiple scans in a shorter period of time to detect treatment effects 
or predict long-term outcomes, possibly lessening the necessary follow- 
up length and increasing the speed at which new, effective therapies 
can reach the patients who need them. 

5. Conclusions 

Linear, multivariable mixed-effects regression can be successfully 
applied to longitudinal MRI data from multiple scanning protocols in 
order to generate a coherent estimate of population-level brain volume 
changes in MS. This method would therefore allow the use of data ac- 
quired with non-uniform MRI acquisition parameters, potentially in- 
creasing sample size and follow-up length in future MS trials using 
atrophy as an endpoint and possibly allowing multicenter trials to over- 
come the obstacle of differing hardware at different sites. Ventricular 
CSF volume is the most reliably measured brain volume, and our results 
support other studies that report that gray matter is primarily responsi- 
ble for atrophy, making the ventricles a more easily quantified surrogate 
to measure gray matter loss. We believe that mixed-effects modeling of 
absolute brain volumes can allow a wider use of available data in the 
study of brain atrophy in MS. 
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